Using the Correlation Intensity Index to Build a Model of Cardiotoxicity of Piperidine Derivatives

The assessment of cardiotoxicity is a persistent problem in medicinal chemistry. Quantitative structure–activity relationships (QSAR) are one possible way to build up models for cardiotoxicity. Here, we describe the results obtained with the Monte Carlo technique to develop hybrid optimal descriptors correlated with cardiotoxicity. The predictive potential of the cardiotoxicity models (pIC50, Ki in nM) of piperidine derivatives obtained using this approach provided quite good determination coefficients for the external validation set, in the range of 0.90–0.94. The results were best when applying the so-called correlation intensity index, which improves the predictive potential of a model.


Introduction
The risk of developing cardiotoxicity against the background of treating carcinogenic pathologies is one of the most urgent problems of modern oncology and cardiology.Piperidine derivatives are of exceptional interest due to their potential biological activity, such as antiviral, antibacterial, antitumor, and many others [1].
Current anticancer therapy includes many drugs with various mechanisms and a spectrum of actions.One of the most important groups in this number of drugs are antibiotics with antitumor activity, which play a massive role in the chemotherapy of various oncological diseases [2].Their effectiveness has been clinically proven.However, despite the favorable course of the disease, when using this group of drugs, patients experience a number of undesirable side effects from various organs and systems that can develop not only during therapy but also after its completion.One of the main side effects is cardiotoxicity.This term includes various adverse events from the cardiovascular system against the background of drug therapy for oncological diseases.Such manifestations of cardiotoxicity as pain in the heart, blood pressure, heart rhythm disturbances, myocarditis, pericarditis, and heart attacks reduce a patient's quality of life.Still, sometimes they become serious reasons for discontinuing or not prescribing the drug.For some medicines in this group, for example, alkylating agents, cardiotoxicity is a limiting factor.
Antibiotics with antitumor activity currently occupy a leading place in the treatment of oncological diseases [3]; as a result, the correction of their side effects, particularly cardiotoxicity, remains one of the most urgent problems for oncologists, cardiologists, and general practitioners.The frequency of development of various dysfunctions of the heart reaches greater values.At the same time, both reversible and irreversible consequences are quite dangerous.Prevention and treatment of cardiotoxicity remain mandatory, but complicated clinical tasks for a doctor due to the irreversibility and progressive nature of the disease changes most in the functioning of the cardiovascular system.Cardiotoxicity complications significantly impair patients' quality of life and reduce the duration of life, and mortality from cardiovascular diseases still globally ranks first [4].
In addition, psychopharmacology and psychopharmacotherapy of depressive states are dynamically developing areas, and antidepressants are the second most prescribed drugs among all psychotropic drugs [5].Such a high rating of these psychotropic drugs is because about 5% of the world's population suffers from depression (according to WHO).However, high doses and long-term use of medications in this group lead to cardiotoxic effects.The cardiotoxicity of tricyclic antidepressants is manifested with conduction disturbances in the atrioventricular node and ventricles of the heart (quinine-like action), arrhythmias, and a decrease in myocardial contractility.Doxepin and amoxapine have the least cardiotoxicity.Treatment of patients with cardiovascular pathology with tricyclic antidepressants should be carefully monitored, and high doses should not be used.
Cardiovascular diseases such as coronary artery disease, valvular heart disease, arrhythmias, and heart failure are serious health risks and often require lifelong treatment.Much attention is paid to the psychological consequences of life for people with cardiac diseases [6].Anxiety symptoms are common among patients with cardiovascular diseases and may worsen the prognosis for these patients.Symptoms of anxiety and depression can prevent lifestyle changes and adherence to therapy, as well as reduce the effectiveness of cardiac rehabilitation.
There is increasing evidence of the widespread use of psychotropic drugs in cardiac patients for comorbid psychiatric disorders [7].The side effects of psychotropic medications on the part of the cardiovascular system include disturbances in the rhythm and conduction of the heart.For example, recent studies have shown that antidepressants were associated with increased mortality and an altered beta-blocker effect in patients with heart failure.In addition, the use of antipsychotic drugs in patients after acute myocardial infarction is necessary [6,7].
An increase in morbidity and mortality under the influence of depression in patients with cardiovascular diseases dictates the imperative need for preliminary analysis of both drugs' therapeutic efficacy and cardiotoxic potential [8].In other words, when developing a treatment plan for a depressed patient with heart disease, one should carefully weigh any intervention's risk/benefit ratio.However, the choice of antidepressants is complicated because many can cause cardiovascular side effects, such as orthostatic hypotension, hypertension, and impaired cardiac conduction.In addition, clinically significant drug interactions should be considered when choosing treatment.Unfortunately, the number of clinical studies explicitly investigating the safety of antidepressants in patients with cardiovascular disease is limited, and the studies that are conducted have generally included a small number of patients.
The human ether-a-go-go-related gene (hERG) potassium channel plays a pivotal role in cardiac rhythm regulation, and the cardiotoxicity data associated with hERG inhibition using drugs and environmental chemicals provides important information for medicinal chemistry [9,10].As stated above, cardiac problems are among the most complex in medicine [11], and there is a clear trend toward them becoming more important [12].Identifying potential human ether-a-go-go-related gene (hERG) potassium channel blockers is an important part of drug discovery and checking up on drug safety processes in pharmaceutical industries and academic drug discovery centers [10].The most popular idea at present is considered to begin the corresponding searches with the choice of idealization (a certain success), that is a molecule that absorbs the preferred qualities in the complete form.In order to prioritize molecules during the early drug discovery phase and to reduce the risk of the necessity of an additional preliminary checking-up of pharmaceutical agents, computational approaches have been developed to predict the potential of hERG blockage of new drug candidates.In other words, estimating the cardiac toxicity of organic hERG blockers is an important theoretical and practical task of medicinal chemistry.Potential hERG inhibitors must be identified for drug discovery and safety [13]; however, the experimental analysis of all potential hERG inhibitors is impossible because there are so many of them.
Computational models for the cardiac toxicity of organic hERG blockers are an attractive alternative to real experiments [14].Quantitative structure-activity relationships (QSAR) are common computational approaches [15].Such models can be obtained using machine learning based on graph theory, support vector machine, random forest, artificial neural networks, and other approaches [16,17].
IIC differs from other criteria of the statistical quality of linear regression models with a unique ability since it is a measure that is sensitive both to the value of the correlation coefficient and to the value of the mean absolute error (MAE).
In principle, CII has some analogy with the known cross-validation measures, but this analogy is partial.While the traditional cross-validated test is based on averaging the difference between the correlation coefficient before and after the "removing" of molecules from the set (training, calibration, or testing), the CII considers the average value of the difference observed removing only molecules which reduce the correlation coefficient in the set.
Here, the ability to improve the predictive potential of cardiotoxicity models using the IIC and CII is studied.

QSAR Models Based on TF 1
The Monte Carlo optimization with the target function TF 1 for three random splits (#1, #2, and #3) provides the following models: Table 1 provides the statistical quality of these models.The Monte Carlo optimization with the target function TF 2 for three random splits (#1, #2, and #3) provides the following models: Table 2 provides the statistical quality of these models.The average value of the coefficient of determination for these models is about 0.6 (for the set as a whole).However, there is a paradox described earlier [28].The influence of the IICc leads to an improvement in the coefficient of determination for the calibration and the validation sets but not to the detriment of training sets, where the coefficient of determination is lower.
Figure 1 compares models calculated with the target functions TF 1 and TF 2 .Models calculated using TF 2 are preferred since Figure 2 confirms that the average determination coefficient values of TF 2 -models are larger than those of TF 1 -models for all three random splits.Williams plots for all considered models indicated that there are practically no outliers for both TF 1 -models and TF 2 -models (Figure 3).

Discussion
The models' advantage is their user-friendliness since their implementation requires only SMILES and numerical data for an endpoint without any other descriptors.There are special rules to define the mechanistic interpretation as well as the applicability domain.
The described approach provides models following OECD principles [30,31].The main essence of the above document concentrated on the well-known five OECD principles [26,[30][31][32] is descripted below: Table 3 lists the molecular features of statistically stable promoters of an increase or decrease of the pIC50.These data are selected according to the following: (i) Molecular features extracted from SMILES or HSG with significant prevalence in the training and calibration sets; (ii) Molecular features which have positive correlation weights (CW) for all three runs of the Monte Carlo optimization; (iii) Molecular features with negative CW for all three runs of the Monte Carlo optimization.
There are stable promoters of the pIC50 increase related to all distributions.For in-

Discussion
The models' advantage is their user-friendliness since their implementation requires only SMILES and numerical data for an endpoint without any other descriptors.There are special rules to define the mechanistic interpretation as well as the applicability domain.The described approach provides models following OECD principles [30,31].The main essence of the above document concentrated on the well-known five OECD principles [26,[30][31][32] is descripted below:

•
A defined endpoint; Table 3 lists the molecular features of statistically stable promoters of an increase or decrease of the pIC50.These data are selected according to the following: (i) Molecular features extracted from SMILES or HSG with significant prevalence in the training and calibration sets; (ii) Molecular features which have positive correlation weights (CW) for all three runs of the Monte Carlo optimization; (iii) Molecular features with negative CW for all three runs of the Monte Carlo optimization.There are stable promoters of the pIC50 increase related to all distributions.For instance, the promoters of an increase of pIC50 are the presence of nitrogen connected with carbon when Morgan extended connectivity of carbon atoms is equal to 5, 6, and 7 or Morgan extended connectivity of nitrogen atoms equals 4. In contrast, promoters of decrease of pIC50 are vertex degrees of carbon atoms equal to 2 or 3 and degrees of nitrogen atoms equal to 2. Some other features become promoters of an increase or decrease of cardiotoxicity (Table 3).The comparison of the statistical quality of the models using the target functions TF 1 and TF 2 presented in Tables 1 and 2 indicates that TF 2 provides better results.
Table 4 contains the comparison of models for cardiotoxicity suggested in the literature.The best model is observed for TF 1 (split-1); however, the results for the other two splits in the case of the TF 1 -model are worse, and the variance of the coefficient of determination for the validation set is significant.In contrast, the average value of the coefficient of determination for the validation set in the case of the TF 2 -model is more significant, and the variance is less than those in the case of the TF 1 -model.Thus, despite the excellent result for split-1 with the TF 1 -model, on the whole, TF 2 -model is the preferable model.The above-mentioned information allows us to state that the proposed models correspond to the five generally mentioned recognized principles of constructing a QSPR/QSAR model.However, it seems appropriate to dwell on a number of features of the considered method.
A very useful feature of the approach under consideration is its significant heuristic potential due to the possibility of approximately formulating statistical hypotheses as follows: -Whether (and if so, how much) the considered endpoint depends on the representation of molecules using SMILES; -Whether (and if so, to what extent) the considered endpoint depends on the representation of molecules using graphs; -Whether the representation of the molecular features extracted from SMILES and the graph provide a synergetic effect (i.Whether CII improves the predictive potential of models based on a graph-based representation of molecules; -Whether the combined use of IIC and CII has a synergistic effect, that is, whether observed improvement of the predictive potential of models occurs if applying IIC and CII together compared to the cases of using IIC and CII separately.
In principle, the list of similar hypotheses that can be formulated and, accordingly, tested within the framework of the approach under consideration, can also be expanded.However, it seems more appropriate to consider the mentioned possibilities, providing them with brief explanations.
In fact, only a part of the hypotheses listed above is considered here.The results can be formulated as follows: 1.The combined use of correlation weighting of SMILES attributes and graph invariants improves the predictive potential of the hERG inhibition model expressed as pIC50; 2. For the considered compounds, the use of CII provides a better predictive potential than that of models built using IIC; 3. The observed statistical results for the three random splits of the available connections in the training and control sets are in good agreement with each other.
Are there valuable models?If there are "valuable" models, then there must be models that are not "valuable".How to distinguish valuable models from not very valuable ones?It has been stated that "All models are wrong, but some are useful" [35].Thus, how to distinguish useful models from a set of wrong ones?The reproducibility of results and their clarity (graphical representation [36]) are most likely the main features of the utility model.In this paper, for this purpose, attempts were made to build several models using different splits.The development of criteria for the predictive potential of models is also part of the research designed to identify useful models.In this paper, for this purpose, attempts were made to compare two new criteria for the predictive potential of the model, the IIC (TF 1 ) and the CII (TF 2 ).
One can extract two basic components in the total large variety of QSAR studies: (i) "applicative" studies and (ii) "theoretical" studies."Applicative" studies aim to integrate the results of applying current approaches to solve practical tasks."Theoretical" studies aim to attempt to develop new conceptions of the QSPR/QSAR analysis.This study contains both applicative and theoreatical parts.On the one hand, here, the Monte Carlo optimization technique described in the literature is aimed to build up (almost) standard models (applicative part).On the other hand, new criteria of the predictive potential are studied (theoretical part).
Thus, the epistemological aspect of the provided QSAR research, here, is presented in the form of confirmation of two statements.First, all QSAR models are random events if they are built using random distributions in training and validation sets.Second, the usefulness of random QSAR models can be stated if the variance in the values of statistical characteristics is acceptably small.
The Supplementary Materials section contains the technical details related to the described approach.

Data
The numerical data on 113 piperidine derivatives (pyridine-substituted piperidines, tertiary alcohol-bearing piperidines, spirocyclic piperidines, and isoxazole-containing piperidines) were taken from the literature [9].The activity is expressed as −logIC50 or pIC50 [9].The set of compounds is split into (i) active training (≈25%), (ii) passive training (≈25%), (iii) calibration (≈25%), and (iv) validation sets (≈25%).Each set has a defined task.The active training set is used to build the model; molecular features extracted from the simplified molecular-input line-entry system (SMILES-which represents the structure) [28,29,37], of the active training set are involved in the Monte Carlo optimization to provide correlation weights for the above features, which provide the largest target function value on the active training set.The passive training checks whether the model for the active training set is satisfactory for SMILES that were not involved in the active training set.The calibration set should detect when overtraining (overfitting) starts.The validation set provides the possibility to assess the predictive potential of a model since the data from the validation set is unknown while building up a model.Our experience with CORAL shows that equal distribution over the four sets mentioned is likely the most rational strategy.
At the beginning of the optimization, the correlation coefficients between the experimental values of the endpoint and the descriptor simultaneously increase for all sets, but the correlation coefficient for the calibration set reaches a maximum; this is the start of overtraining, and further optimization leads to a decrease of the correlation coefficient for the calibration set.Optimization should be stopped when overtraining starts.After stopping the Monte Carlo optimization procedure, the validation set is needed to assess the model's predictive potential.

Optimal Descriptor
The optimal descriptor, calculated with the representation of the molecular structure using the SMILES, 37serves as the basis of a model for cardiotoxicity.The optimal descriptor for the predictive model of the endpoint is calculated with Equation ( 7): where T is an integer that separates molecular features extracted from SMILES into rare and non-rare ones.The non-rare features serve to build up the model.The rare features are not used to build up the model.N is the number of epochs in the optimization of the correlation weights.Sk is a SMILES atom, i.e., one SMILES line symbol (e.g., '=', 'O') or a group of symbols that cannot be examined separately (e.g., 'Cu', '%11').SSk is a couple of SMILES atoms.CW(S k ) and CW(SS k ) are the correlation weights of the SMILES attributes (SAk).NA is the number of non-rare SMILES attributes.
The Supplementary Materials (Table S1) contains an example of the DCW(1, 15) calculation.

Monte Carlo Optimization
Equation (2) needs the numerical data on the above correlation weights.Monte Carlo optimization is employed to calculate the correlation weights.Here, two target functions for the Monte Carlo optimization are examined: TF 1 = TF 0 + I IC C × 0.5 (10) Equation ( 3) is defined empirically during the development of many different models.Variables r AT and r PT are correlation coefficients between the observed and predicted values of the endpoint for the active training set and passive training set, respectively.IIC C is the index of ideality of correlation [28].IIC C is calculated with data on the calibration set as follows: min(x, y) = x, i f x < y y, otherwise max(x, y) = x, i f x > y y, otherwise − MAE C = 1 The corresponding values of the endpoint are observed and calculated.The correlation intensity index (CII), similar to the IIC, was developed to improve the quality of the Monte Carlo optimization used to build up QSPR/QSAR models.CII is calculated as follows: where R 2 is the correlation coefficient for a set that contains n substances.R 2 k is the correlation coefficient for n − 1 substances of a set after removing the k-th substance.If (R 2 k − R 2 ) is larger than zero, the k-th substance is an "opponent" for the correlation between the experimental and predicted values of the set.A small sum of "protests" means a more "intense" correlation.

Applicability Domain
The described models' applicability domain defines the "statistical defects" of molecular features extracted from SMILES or HSG.These are calculated as follows: where P(A k ), P (A k ), and P (A k ) are the probability of A k in the active training, passive training, and calibration sets, respectively; N(A k ), N (A k ), and N (A k ) are frequencies of A k in the active training, passive training, and calibration sets, respectively.The statistical SMILES defects (D j ) are calculated as follows: where NA is the number of non-blocked SMILES attributes in the SMILES.A SMILES falls in the applicability domain if where D is the average statistical defect on all compounds.

Figure 1 .
Figure 1.Examples of models calculated using target functions TF1 and TF2.

Figure 1 .
Figure 1.Examples of models calculated using target functions TF 1 and TF 2 .

Figure 2 .
Figure 2. The comparison of determination coefficients for the active training and the validation using TF1 and TF2 for splits 1-3.

Figure 2 .
Figure 2. The comparison of determination coefficients for the active training and the validation sets using TF 1 and TF 2 for splits 1-3.

Figure 3 .
Figure 3. Williams plots of models calculated using target functions TF1 and TF2 for splits 1-3.The training set is the union of the active and passive training sets together with the calibration set; the compounds of the external test set are indicated in red.

•
A defined endpoint; • An unambiguous algorithm; • A defined applicability domain; • Appropriate measures of goodness-of-fit, robustness, and predictivity; • A mechanistic interpretation, if possible.

Figure 3 .
Figure 3. Williams plots of models calculated using target functions TF 1 and TF 2 for splits 1-3.The training set is the union of the active and passive training sets together with the calibration set; the compounds of the external test set are indicated in red.

*
NA, NP, and NC are the frequencies of a molecular feature in the active training, passive training, and calibration sets, respectively.

Figure 4 .
Figure 4.The influence of presence/absence of promoters on the calculated cardiotoxicity.
e., improving the predictive potential of a model in the comparison of the separate cases considering the SMILES-based model and graph-based model); -Whether IIC improves the predictive potential of models based on SMILES-based representation of molecules; -Whether IIC improves the predictive potential of models based on a graph-based representation of molecules; -Whether CII improves the predictive potential of models based on SMILES-based representation of molecules; -

Table 1 .
The statistical quality of the QSAR model for cardiotoxicity was obtained with Monte Carlo optimization with target function TF 1 .
* A = active training set; P = passive training set; C = calibration set; V = validation set.

Table 2 .
The statistical quality of the QSAR model for cardiotoxicity was obtained with Monte Carlo optimization with target function TF 2 .

Table 3 .
Promoters of an increase or decrease of cardiotoxicity (pIC50) that were observed for computational experiments with split-1.
* NA, NP, and NC are the frequencies of a molecular feature in the active training, passive training, and calibration sets, respectively.

Table 4 .
The comparison of QSAR models for cardiotoxicity as suggested in the literature.